This document can be viewed at: https://timzai.github.io/siads532_strava/
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from itertools import permutations
from itertools import combinations
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
# pio.renderers.default = 'iframe'
%matplotlib inline
%load_ext watermark
%watermark -v -m -p pandas,numpy,matplotlib,plotly
Rule et al’s Ten simple rules for writing and sharing computational analyses in Jupyter Notebook
The below rules will be followed throughout this assignment:
watermark.Alright, so I haven't really had any experience with fitness data nor have I really done much running or cycling, so I really want to just start by exploring what the data is about before deciding on how to proceed.
I'll just start by loading the data and sampling from the it.
all_data = pd.read_csv("assets/strava.csv")
print(len(all_data))
display(all_data.sample(5))
Alright, so we got 22 columns and a timestamp column that we can probably infer to datetime. Notice that there are already a ton of NaNs, this is expected because Chris said that a few sensors were used and they may not be present during all exercises.
Let's do a quick summary of the data using .dtypes, .isna().sum(), and .nunique().
all_data = pd.read_csv("assets/strava.csv", parse_dates=['timestamp'], infer_datetime_format=True)
all_data_summary = pd.DataFrame([all_data.dtypes,all_data.isna().sum(),all_data.nunique()], index=['dtypes','isna().sum()','nunique()']).T
display(all_data_summary)
So a few things that I could quickly infer from this:
nunique() is same as len() of alldataNaN data like the earlier observationfloats, so a lot of numeric analysis can be done alreadySince there are a lot of NaNs, I think it'll be a good idea to see the how much of each column's non-NaN overlap with others. I'll use a scatter plot to represent the matrix of column combinations with color representing the overlap count. This shoould allow me to easily visualize which columns have higher count of overlap with other columns.
overlaps = pd.DataFrame(permutations(all_data.columns,2),columns=['x','y'])
overlaps['num_rows'] = overlaps.apply(lambda r: all_data[r.x].notna().sum() & all_data[r.y].notna().sum(), axis=1)
overlaps.sample(5)
sc = plt.scatter(overlaps.x, overlaps.y, c=overlaps.num_rows)
plt.colorbar(sc)
plt.show()
Okay! This is totally what I wanted - except that "Air Power" is weirdly situated, the axis labels are a mess and I think that I can probably remove a few columns that I definitely won't be using, or remove one of lat/long since those appear to come in pairs. I'm already seeing some block-like patterns indicating that certain columns probably appear with others more frequently (same sensors perhaps!)
alt_columns = list(all_data.columns)
# these exists for all entries, so we can remove them (distance is still there)
alt_columns.remove("datafile")
alt_columns.remove("timestamp")
# same as position_lat
alt_columns.remove("position_long")
# I don't know what these are also
alt_columns.remove("unknown_87")
alt_columns.remove("unknown_88")
alt_columns.remove("unknown_90")
# because enhanced is better
alt_columns.remove("speed")
alt_columns.remove("altitude")
# there is cadence
alt_columns.remove("fractional_cadence")
overlaps2 = pd.DataFrame(permutations(alt_columns,2),columns=['x','y'])
overlaps2['num_rows'] = overlaps2.apply(lambda r: all_data[r.x].notna().sum() & all_data[r.y].notna().sum(), axis=1)
sc2 = plt.scatter(overlaps2.x, overlaps2.y, c=overlaps2.num_rows)
plt.colorbar(sc2)
plt.show()
OK that looks cleaner, but I think I am thinking I should use another library so I can hover over the points to see their information. It'll be much easier to pick from the matrix interactively. I'll use plotly for this.
fig = px.scatter(overlaps2, x="x", y="y", color="num_rows", color_continuous_scale=px.colors.sequential.matter)
fig.update_yaxes(title=None, categoryarray=alt_columns)
fig.update_xaxes(title=None)
fig.update_traces(marker=dict(size=16))
fig.show()
OK that is exactly what i wanted, I can see that we won't be able to analyze certain combinations due to some of the data not overlapping much, but the orange and dark purple dots are all valid choices to do further cross analysis on.
But wait! Although we have a pretty good understanding of which columns came from the same sensors, let's just triple check by checking overlaps for 3 columns at once! I think this is a good idea to use 3D so we can move the viewpoint around to see whether there are specific column combinations that yield more overlap than others.
overlaps3 = pd.DataFrame(permutations(alt_columns,3),columns=['x','y','z'])
overlaps3['num_rows'] = overlaps3.apply(lambda r: all_data[r.x].notna().sum() & all_data[r.y].notna().sum() & all_data[r.z].notna().sum(), axis=1)
overlaps3 = overlaps3[overlaps3['num_rows'] > 5000]
fig = px.scatter_3d(overlaps3, x="x", y="y", z="z", color="num_rows", range_color=(0,overlaps3.num_rows.max()),
color_continuous_scale=[[0, 'rgba(255, 255, 50, 0.5)'], [0.5, 'rgba(208, 71, 85, 1)'], [1.0, 'rgba(48, 15, 62, 1)']])
fig.update_layout(scene = dict(
xaxis = dict(title=None, categoryarray=alt_columns),
yaxis = dict(title=None, categoryarray=alt_columns),
zaxis = dict(title=None, categoryarray=alt_columns)))
fig.update_traces(marker=dict(size=4))
fig.show()
I ended up hiding the low counts to come up with a readable 3D plot. This plot clearly shows that there are pretty much 2 clusters of data with lots of overlaps. The 3D plot wasn't as cool as I'd hope it would be, but it was what I expected - the points from the sensors being added together resembling cube blocks of data.
Now I already did SPLOMs for assignment 3, but I felt that they were super helpful in identifying trends with an open mind. I will do a few SPLOMs with nothing in mind particularly, but would like to have some reusable code to easily generate SPLOMs.
For the initial SPLOMs, I'll look into the heart_rate, distance, and speed data separately first, according to each exercise session. I'll also add in duration and timestamp so we could explore if there are trends to be found related to Chris' exercise time and whether he made overall improvements as time passes.
def load_data(c):
# Loads data with specific columns in mind
cols=list(c.keys())
df = pd.read_csv("assets/strava.csv", parse_dates=['timestamp'], infer_datetime_format=True)
df = df[['timestamp','datafile']+cols].dropna(subset=cols)
return df
def summarize_data(d):
# Aggregate based on datafile (which indicates 1 exercise session) with columns and aggregate functions.
df = load_data(d)
d['timestamp'] = ['min', 'max']
df = df.groupby(by="datafile").agg(d)
df.columns = df.columns.map('_'.join)
df['duration'] = df.loc[:,'timestamp_max'] - df.loc[:,'timestamp_min']
df = df[df.duration > pd.Timedelta('10 min')] # you ain't working out if it's less than 10 minutes!
df['duration_s'] = df['duration'] / np.timedelta64(1, 's')
df['time'] = df['timestamp_min'].astype(int)/ 10**9
return df
def plot_splom(df,cols,size=None):
# plot some generic sploms!
if size == None:
size = (len(cols)*2,len(cols)*2)
axes = scatter_matrix(df[cols], alpha = 0.2, figsize = size, diagonal = 'kde')
corr = df[cols].corr().as_matrix() # let's show correlations on each subplot
for i, j in zip(*plt.np.triu_indices_from(axes, k=1)):
axes[i, j].annotate("%.3f" %corr[i,j], (0.8, 0.8), xycoords='axes fraction', ha='center', va='center')
return axes
Lets start with a heart rate SPLOM.
hr_cols = {'heart_rate': ['min', 'max', 'mean','std']}
hr_splom_cols = ['time','duration_s','heart_rate_min','heart_rate_max','heart_rate_mean','heart_rate_std']
hr_sessions = summarize_data(hr_cols)
axes = plot_splom(hr_sessions, hr_splom_cols);
axes[3,4].patch.set_facecolor((1, 1, .66))
axes[0,3].patch.set_facecolor((1, .9, .9))
axes[0,4].patch.set_facecolor((1, .9, .9))
Nothing too interesting here except for a positive correlation between heart rate max and mean (highlighted in yellow). It's not too surprising but it's also interesting that heart rate min is less correlated. There are also some other moderated correlation between time (date and time of the exercise) and heart rate - indicating that Chris has somewhat been working his heart harder as time passed (highlighted in red).
Lets look at the distance SPLOM.
distance_cols = {'distance': ['max']}
distance_splom_cols = ['time','duration_s','distance_max']
distance_sessions = summarize_data(distance_cols)
axes = plot_splom(distance_sessions, distance_splom_cols);
axes[1,2].patch.set_facecolor((1, 1, .66))
OK, I wasn't expecting something like this to show up but it appears we got a V with one of our subplots (highlighted in yellow). The reason for this is because Chris has been doing running and cycling, which have different speeds which is captured by this specific plot in the SPLOM (distance / duration). Since we're talking about speed, let's see if anything shows up in the speed SPLOM.
Lets look at the speed SPLOM then!
speed_cols = {'speed': ['mean','max','std']}
speed_splom_cols = ['time', 'duration_s', 'speed_max', 'speed_mean', 'speed_std']
speed_sessions = summarize_data(speed_cols)
axes=plot_splom(speed_sessions, speed_splom_cols);
axes[0,2].patch.set_facecolor((1, 1, .66))
It was a bit surprising to not spot any clear trends or clusters here initially, but after looking at the speed_max plots, it seems that we have an outlier with max speed of 7744. If we were to remove that, we may have a more normal speed_max to compare - but it may also result in being very similar to speed_mean where there wasn't really anything interesting.
So with the convenience functions, we can probably continue to generate more SPLOM, but we will stop here. I would like to move on to timeseries with heart rate data since I am interested in seeing if Chris had any changes in his exercise routines.
Heart rate data is something that I am kind of interested in looking into - when I did exercise before my coach had a strict heart rate range I must stay in. Let's see if Chris' heart rate data has any patterns!
I would like to analyze something related to the heart rate zone.

(image from here)
While Chris isn't sure what his maximum heart rate is, he suggested that perhaps 180 is a reasonable value.
Let's begin by plotting heart rate against time. We'll get all rows in the strava file with a non-NaN value for heart_rate. I also want to resample the data to per second data and interpolate the missing data.
def load_heart_rate_data(datafile=[]):
df = pd.read_csv("assets/strava.csv", parse_dates=['timestamp'], infer_datetime_format=True)
df = df[['timestamp','datafile','heart_rate']].dropna(subset=['heart_rate'])
df.set_index('timestamp', inplace=True)
df = df[df['datafile'].isin(datafile)]
df = df.resample('s').interpolate()
df.drop(columns=['datafile'], inplace=True)
return df
# picking a random datafile to see how things look first
# I picked `activities/2770194670.fit.gz` because it has a good variance of heart rate data for the zones
picked_datafile = "activities/2770194670.fit.gz"
hr_df = load_heart_rate_data([picked_datafile])
fig = go.Figure()
fig.add_trace(go.Scatter(x=hr_df.index, y=hr_df.heart_rate))
fig.show()
Next, let's try to fill in the zones with the right colors. I'm going to define max_hr and zones as variables with Chris' max heart rate as well as the zone ranges from above.
MAX_HR = 180
DEFAULT_ZONES = {0:'lightgrey', 50:'silver', 60:'mediumturquoise', 70:'yellowgreen', 80:'gold', 90:'crimson'}
def zone_bins(zones = DEFAULT_ZONES, max_hr = MAX_HR):
bin_values = [max_hr*z/100 for z in zones.keys()] + [np.inf]
bin_names = ["{}%-{}%".format(a,b) if b != np.inf else "{}%+".format(a) for (a,b) in list(zip(list(zones.keys()), list(zones.keys())[1:]+[np.inf]))]
return (bin_values,bin_names)
def load_heart_rate_data_with_zone(datafile=[],
max_hr=MAX_HR,
zones = DEFAULT_ZONES):
bin_values, bin_names = zone_bins(zones, max_hr)
df = load_heart_rate_data(datafile)
df['hr_binned'] = pd.cut(df.heart_rate, bin_values, labels=bin_names)
df['hr_zone_color'] = pd.cut(df.heart_rate, bin_values, labels=zones.values())
return df
hrz_df = load_heart_rate_data_with_zone([picked_datafile])
fig = go.Figure()
for color, hrzc_df in hrz_df.groupby('hr_zone_color'):
if len(hrzc_df) == 0:
continue
fig.add_trace(go.Bar(x=hrzc_df.index,
y=hrzc_df.heart_rate,
name=hrzc_df.iloc[0].hr_binned,
marker={'color': hrzc_df.hr_zone_color, 'line.width':0}))
fig.add_trace(go.Scatter(x=hrz_df.index, y=hrz_df.heart_rate,
line=dict(color='darkgrey', width=1), showlegend=False))
fig.update_layout(
title = 'Heart Rate Zones',
xaxis = dict(title='Date and Time'),
yaxis = dict(title='Heart Rate (bpm)'),
bargap=0
)
fig.show()
Alright! That took a long longer than expected for me to figure out, but it was the end result that I kind of wanted. I'm not sure about the very narrow lines though, maybe a rolling mean will help remove some of that.
hrz_df = load_heart_rate_data_with_zone([picked_datafile])
hrz_df.heart_rate = hrz_df.heart_rate.rolling('20s').mean()
fig = go.Figure()
for color, hrzc_df in hrz_df.groupby('hr_zone_color'):
if len(hrzc_df) == 0:
continue
fig.add_trace(go.Bar(x=hrzc_df.index,
y=hrzc_df.heart_rate,
name=hrzc_df.iloc[0].hr_binned,
marker={'color': hrzc_df.hr_zone_color, 'line.width':0}))
fig.add_trace(go.Scatter(x=hrz_df.index, y=hrz_df.heart_rate,
line=dict(color='darkgrey', width=1), showlegend=False))
fig.update_layout(
title = 'Heart Rate Zones with 20s Rolling Mean',
xaxis = dict(title='Date and Time'),
yaxis = dict(title='Heart Rate (bpm)'),
bargap=0
)
fig.show()
OK, I guess no dice on the narrow bars. I'm going to keep the rolling 20s means thought because I think it has a good balance for smoothness and keeps enough details.
There's another chart that I want to show though - and that's seeing how much of each heart rate zone Chris stays in during his exercises. I think a bar chart could do that justice.
def plot_hr_zone_distribution(df):
fig = go.Figure(go.Bar(
x=df, y=df.index, orientation='h',
marker_color=list(DEFAULT_ZONES.values())
))
fig.update_layout(
title = 'Heart Rate Zones Distribution',
xaxis = dict(title='Time in Zone (minutes)'),
yaxis = dict(title='Heart Rate Zone'),
)
fig.show()
plot_hr_zone_distribution(hrz_df.hr_binned.value_counts(sort=False)/60)
Finally, I want to be able to compare the distribution of heart rate zones across sessions. I think a heatmap can show any sort of patterns.
df = pd.read_csv("assets/strava.csv", parse_dates=['timestamp'], infer_datetime_format=True)
df = df[['timestamp','datafile','heart_rate']].dropna(subset=['heart_rate'])
df.set_index('timestamp', inplace=True)
bin_values, bin_names = zone_bins(DEFAULT_ZONES, MAX_HR)
all_df = []
for datafile, adf in df.groupby('datafile'):
time = adf.index[0].strftime('%B %d, %Y, %r')
adf = adf.resample('s').interpolate()
adf['hr_binned'] = pd.cut(adf.heart_rate, bin_values, labels=bin_names)
adf = adf.groupby(['hr_binned']).count().reset_index()
adf['percent'] = 100 * adf['heart_rate'] / adf['heart_rate'].sum()
adf['timestamp'] = time
adf['datafile'] = datafile
all_df += [adf]
resampled_df = pd.concat(all_df)
fig = go.Figure(data=go.Heatmap(z=resampled_df.percent, x=resampled_df.timestamp, y=resampled_df.hr_binned, hoverongaps = False))
fig.update_layout(autosize=False,
title = 'Heart Rate Zones Distribution Across Sessions',
xaxis = dict(title='% Time in Zone', showticklabels=False),
yaxis = dict(title='Heart Rate Zone')
)
fig.show()
Here we can see that Chris' exercise pattern has changed over time. He is spending more time in the 80-90% ranges in his later sessions compared to earlier ones, where more time is spent in the 60-70% ranges.
There's also something really alarming about his exercise on August 17th, where he spent 62% of his time (19 minutes) in the 90%+ zone! I don't think you're supposed to stay in the 90%+ zone that much, as it can be dangerous! I hope that either the max heart rate is not 180 or something is wrong with the sensor that day...
display(resampled_df[resampled_df['timestamp'] == "August 17, 2019, 11:50:36 AM"])
Let's pull up the earlier bar chart with the exercise data from this session!
hrz_df2 = load_heart_rate_data_with_zone(["activities/2786247269.fit.gz"])
plot_hr_zone_distribution(hrz_df2.hr_binned.value_counts(sort=False)/60)